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Abstract Direct numerical simulation of three-dimensional incompressible flows 
at high Reynolds number using the unsteady Navier-Stokes equations is challeng- 
ing. In order to obtain accurate simulations, very fine meshes are necessary, and 
such simulations are increasingly important for modern engineering practices, such 
as understanding the flow behavior around high speed trains, which is the target 
application of this research. To avoid the time step size constraint imposed by 
the CFL number and the fine spacial mesh size, we investigate some fully implicit 
methods, and focus on how to solve the large nonlinear system of equations at 
each time step on large scale parallel computers. In most of the existing implicit 
Navier-Stokes solvers, segregated velocity and pressure treatment is employed. In 
this paper, we focus on the Newton-Krylov-Schwarz method for solving the mono- 
lithic nonlinear system arising from the fully coupled finite element discretization 
of the Navier-Stokes equations on unstructured meshes. In the subdomain, LU or 
point-block ILU is used as the local solver. We test the algorithm for some three- 
dimensional complex unsteady flows, including flows passing a high speed train, 
on a supercomputer with thousands of processors. Numerical experiments show 
that the algorithm has superlinear scalability with over three thousand processors 
for problems with tens of millions of unknowns. 
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1 Introduction 


Because of the advancement of supercomputing, parallel computational fluid dy- 
namics has become an enabling technology supporting a wide range of applications 
in science and engineering. For example, simulations of flow around an object with 
a complicated shape are particularly useful in aerodynamic vehicle design. Many 
engineering and atmospheric flows are turbulent, and understanding the behav- 
ior of the flows is of great practical importance. Roughly speaking, turbulence 
simulation methodologies can be classified into Reynolds-averaged Navier-Stokes 
approaches (RANS), large-eddy simulation (LES), and direct numerical simula- 
tion (DNS). RANS is based on the Reynolds decomposition of the flow variables 
into their time-averaged and fluctuating quantities. We refer interested readers to 
[?] for details. LES is intermediate between RANS and DNS, which ignores the 
small turbulent scales and computes the dynamics of the large scales [?]. LES was 
the most popular technique for the turbulence simulations in the last few decades 
[?,?,?]. DNS is the most accurate method for the numerical simulation of turbulent 
flows, and is also the most expensive one in terms of the computational cost. In 
DNS, the Navier-Stokes equations are numerically solved without any turbulence 
model, which means that the momentum equation of the Navier-Stokes equations 
must be exactly solved. DNS is an useful tool in fundamental research on tur- 
bulence, and using DNS it is possible to perform certain “experiments” that are 
difficult or sometimes impossible to obtain in actual experiments. Some reviews 
about the DNS can be found in the references [?,?,?]. In this study, we focus on 
studying 3D incompressible flows around complex bodies. A key motivation for 
the current work is to simulate unsteady realistic flow around a high speed train. 
Because of the lack of parallel scalability, commercial CFD software packages can 
only be used when the mesh is not too fine, and thus don’t offer sufficient accuracy. 

In DNS, in order to obtain sufficiently accurate solutions, small spatial scales 
of the complex flows must be resolved by the computational mesh, thus the com- 
putation is usually very demanding, requiring large scale parallel computers for 
their memory capacity and processing speed. It is clear by now that the increase 
of computing power is no longer from faster processors, but from the increase of 
the number of processors. This makes the scalability of the algorithm more im- 
portant than ever. Many researchers have studied parallel algorithms for DNS. 
For examples, Yokokawa et al. [?] studied DNS of incompressible turbulence flows 
with the Fourier spectral method using over 4000 processors; Chen [?] studied 
DNS of chemistry turbulence on a Petascale supercomputer using a finite differ- 
ence method for the spatial discretization and an explicit Runge-Kutta method 
for the temporal discretization; Rahimian et al. [?] studied DNS of blood flows 
using nearly 200 thousand processor cores with over 90 billion unknowns. 

In this work, we present a Newton-Krylov-Schwarz (NKS) based parallel im- 
plicit solver for the unsteady incompressible Navier-Stokes equations. NKS is a 
general purpose parallel solver for nonlinear systems and has been widely used to 
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solve different kind of nonlinear problems; see e.g., [?,?,?,?]. Our algorithm begins 
with a discretization of the incompressible unsteady Navier-Stokes equations on an 
unstructured tetrahedron mesh with a stabilized finite element method in space 
and a fully implicit backward difference scheme in time. At each time step, an 
inexact Newton method is employed to solve the discretized large sparse nonlinear 
system and in the Newton steps, a domain decomposition preconditioned Krylov 
method is used to solve the Jacobian system which is constructed analytically in 
order to obtain the desired performance. The most important component of the 
solver is the monolithic Schwarz preconditioner that keeps the strong coupling 
of the velocity and pressure variables at each mesh point throughout the entire 
algorithm. 

The rest of the paper is organized as follows. In Section 2, we briefly intro- 
duce the governing equations, and the discretization of the governing equations is 
discussed in Section 3. In Section 4, the Newton-Krylov-Schwarz algorithm is in- 
troduced, and some numerical results are presented in Section 5. Some concluding 
remarks are given in Section 6. 


2 Governing equations 


The Navier-Stokes equations are the fundamental governing equations that de- 
scribe the flow of a viscous fluid. In this paper, the incompressible unsteady Navier- 
Stokes equations are used to model the flow. Let 2 C R? be the spatial domain 
of interest bounded by the boundary I = Tiniet U Mwai U Toutiet. The equations 
read as, in the vector form: 


p(Gtu-vu)-v-o=Fing, (1) 
V-u=O0in 2, 


where u is the velocity, o = —pI + (Vu + (Vu)”) is the Cauchy stress tensor, p 
is the pressure, p is the fluid density, jz is the dynamic viscosity, and f represents 
the body force or the source term. A given velocity profile g is chosen on the inlet 
boundary [jniez, no-slip boundary conditions are used on the wall Ii; and on 
the outlet boundary outlet the stress-free boundary conditions are imposed: 


u=g on Iinlet, 
u=0 on Toat (2) 


o:-n=0 on Toutlet, 


where n is the outward unit normal vector on the domain boundary. The initial 
condition for the velocity is specified as: 


u=uo in Q at t=0. (3) 


Here uo is a given function. 
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3 Fully-implicit finite element discretization 


We use a P1 — P1 finite element method to discretize the Navier-Stokes equations 
(??) in the spatial domain. To describe the finite element method, we first define 
the trial and weighting function spaces as 


U = {u(.,t) | u(t) € [H (Q),  ul-,t)=g on Tinet}, 

Uo = {u(-,t) | u(-,t) € [H'(2)]°, u(-,t) =O on Timer UT uant 

P = {p(-,t) | p(-,t) € L7(2)}. 
Then, the weak form of the Navier-Stokes equations takes the form: Find u € U, 
p € P such that 


ðu 


o OF odo +u f Vu: Ved? +p | (u -V)u- &dQ 


- | vv-san+ | (v-wyae= | £- oan, (4) 


holds for all € Uo and y E P. 

The finite element discretization begins with meshing the computational do- 
main with an unstructured tetrahedral mesh 7” = {K}. The finite dimensional 
trial and weighting spaces can then be established as 


= {u"(.,¢ (t -Sot u’( t)=g on Pinet}, 


uğ = = fu" (.,t (nt = Sat u"(-, t) =0 on Tinlet U Dwait}, 


N, 


p” = {p"(-,t) |p” t) = pepe (-,t)}, 
i=1 

where u? € R?, p? € R are the nodal values of the velocity and pressure functions. 
Nu and Np are the number of nodes for velocity and pressure, respectively. Each 
of the three components of &” and vy! are the basis functions which are piecewise 
linear functions. Since the P1 — P1 element does not satisfy the Ladyzenskaja- 
Babuska-Brezzi (LBB) condition, we need to add suitable stabilization terms to 
fulfill the LBB condition. For this purpose, we employ the stabilization technique 
introduced in publications [?,?]. The semi-discrete stabilized finite element formu- 
lation of (??) is given as follows: Find u” € U”, p € P” such that 


p ðu” aan +u | Vu": Ve "ante f lv V)u" .o' dn 
o ot Q 
- | viv stan f w: uja + Y (Vut, rev- p" 
AP 


KET” 


- 

ðu” h h h h h h 

XO (5 + (u. Vu” + Vp", mhu” -VD + Ve") 
K 


=| fdt Y (f, mlu” Vo +V) , 8) 
Q K 
KET” 
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holds for all 6” € UË and y” € P”. The underlined terms are the stabilization 
terms where the parameters Te and Tm are defined as follows: 


1 


4 H . ; 


—— 
8Tmtr(G) 


Tm 
Te = 


Here Gij = SS gat Set is the covariant metric tensor and ge represents the 
Jacobian of the mapping between the reference and the physical element. 
We use an implicit backward finite difference formula with a fixed time step 


size, At, to discretize (??) in time. For a given semi-discretized system 


dx 
ay fe 
"a 
the formula is defined as 
gr — xr! 
z= L de 
Ai (z") 


At the nt” time step, we need to solve a nonlinear system 
F"(X") = 0, (6) 


with the initial guess X"~' (the solution of the (n—1)*” time step), to obtain the 
solution of the nt” time step X”, which is the nodal values of the velocity and pres- 
sure. The ordering of the nodal values and the corresponding nonlinear functions 
is not important for the accuracy of the solution, but is very important for the 
convergence of the algebraic solver and also the performance of the solver in terms 
of the computing time. In most existing approaches, the field-by-field ordering is 
often used, as a result, the Jacobian of the nonlinear system has a saddle point 
structure, which plays a key role in the design of the iterative method and its pre- 
conditioner. We do not use the field-by-field ordering. We order the variables and 
functions element by element and in each element, the variables are ordered node 
by node. This ordering helps constructing the point-block incomplete LU factor- 
ization that is more stable than the classical pointwise ILU, and also improving the 
cache performance and the parallel efficiency in load and communication balance. 


4 Monolithic Newton-Krylov-Schwarz algorithm 


In most Navier-Stokes solvers, such as the projection methods, the operator is split 
into the velocity component and pressure component, and the algorithm takes the 
form of a nonlinear Gauss-Seidel iteration with two large blocks. In the monolithic 
approach that we consider in this paper, the velocity and pressure variables of a 
grid point stay together throughout the computation. In this approach, the two 
critical ingredients are the monolithic Schwarz preconditioner, and the point-block 
ILU based subdomain solver. 
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The nonlinear system (??) is solved by a Newton-Krylov-Schwarz method 
which uses an inexact Newton method [?] as the nonlinear solver, a Krylov sub- 
space method (GMRES) [?] as the linear solver at each Newton step, and an over- 
lapping Schwarz method [?] as the preconditioner. The framework of the Newton- 
Krylov-Schwarz method reads as 
Algorithm NKS 


Step 1. Use the solution of the previous time step as 
the initial guess X? = X”! 
Step 2. Fork =0,1,--- until convergence 
e Construct the complete Jacobian matrix Jj; 
e Solve the following right-preconditioned Jacobian system 


inexactly by a Krylov subspace method 
Ji (Mi)! MES; = —F"(Xz) (7) 
e Do a cubic line search to find a step length TẸ 


o Set Xky1 = Xk + Tk Sk 
Here Jy is the full Jacobian of F” (X) at point X%, including the stabilization 
terms, Mj is an additive Schwarz preconditioner to be introduced shortly. The 
inexactness mentioned in Step 2 means that the accuracy of the solution to the 
Jacobian system (??) is in the sense of 


|| Je (Mg) M; Sk + F"(X2) ||< nb || F” (XE) Il, (8) 


where nj, is the relative tolerance for the linear solver. For simplicity, we ignore 
the scripts n and k for the rest of the paper. 

In NKS, the most difficult and time-consuming step is the solution of the 
large, sparse, and nonsymmetric Jacobian system (??) by a preconditioned GM- 
RES method. In the Jacobian solver, the most important component is the pre- 
conditioner; without which GMRES method doesn’t converge or converges very 
slowly, and a good choice of preconditioner accelerates the convergence signifi- 
cantly. Let np be the number of processors of the parallel machine. In this paper, 
we use an overlapping restricted additive Schwarz preconditioner [?], where we 
first partition the computational domain (2 into np nonoverlapping subdomains 
QQ, (L = 1,-++,np) and then extend each subdomain N, to an overlapping sub- 
domain 2 by including 6 layers of elements belonging to its neighbors. In each 
overlapping subdomain, we define a local Jacobian matrix J; which is the restric- 
tion of the global Jacobian matrix J to RÊ with the restriction operator R?. R 
is a matrix which maps the global vector of unknowns to those belonging to NR? 
by simply extracting the unknowns that lie inside the subdomain. In practice, J; 
is obtained by taking the derivatives of the discretized Navier-Stokes equations 
(??) in RË with homogeneous Dirichlet boundary conditions in the interior of 
N, and the physical boundary conditions on 092. The restricted additive Schwarz 
preconditioner is defined as the summation of the local preconditioners B‘ of Ji: 


Maras = X (R?) B Rr, (9) 
l=1 
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where the restriction operator R? is defined as the restriction to the unknowns 
in the non-overlapping subdomain §2;. In practice, we only need the application 
of B; l to a given vector, which can be obtained by solving a subdomain linear 
system. Since B7‘ is used as a preconditioner here, the subdomain linear system 
can be solved exactly or approximately. Both approaches are studied in this paper. 
LU factorization is computationally expensive and requires large memory resources 
when the local matrix J; is large. An economical alternative is the incomplete 
LU factorization (ILU) [?,?] which reduces the computation by dropping some 
fill-in elements in predetermined nondiagonal positions that are generated during 
the factorization process. In this paper, we use a point-block ILU as the local 
preconditioner, where we group all physical components associated with a mesh 
point as a block and always perform an exact LU factorization for this small block, 
in addition, the velocity and pressure variables associated with a given mesh point 
is either kept or dropped together. The effectiveness and the computational cost 
of the subdomain preconditioner depend on the number of elements dropped. 


5 Numerical experiments 


In this section, we report some numerical results of the proposed algorithm. Our 
solver is implemented on top of the Portable Extensible Toolkit for Scientific com- 
putation (PETSc) [?]. Even though, most components of our discretization scheme 
have been studied by others, but the overall finite element scheme is new. To test 
its correctness, we first simulation a flow around a cylinder. The second test case 
represents our target application. The unstructured tetrahedral meshes for the 
first test case are generated with CUBIT [?] from Sandia National Laboratory 
and the geometry of the second test case is generated with AutoCAD and meshed 
by using ANSYS. The mesh partitions for the additive Schwarz preconditioner are 
obtained with ParMETIS [?] of University of Minnesota. The results showed in 
this section are obtained on the Dell PowerEdge C6100 Cluster at the University 
of Colorado Boulder. The stopping conditions for the nonlinear and linear solver 
are that when the residuals of the nonlinear and linear equations are reduced by 
a factor of 1071? and 107%, respectively. 

In the simulations, a very important parameter is the Reynolds number (Re) 
which is a dimensionless number that determines the ratio of inertial forces to 
viscous forces. Usually, a low Reynolds number implies a laminar flow and a high 
Reynolds number corresponds to a turbulent flow. The Reynolds number is defined 
as 

A (10) 
H 
where U is the mean velocity of the object relative to the fluid. In this paper, we 
choose ū as the mean velocity at the inlet boundary. L is the characteristic length 
and it is the diameter of the obstacle in this paper. 


5.1 Benchmark problem 


We first test the algorithm for a well-understood benchmark problem, flow around 
a cylinder, defined in [?,?]. The detailed geometry of the problem is shown in 
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Fig. 1 Flow passing a cylinder in a channel 


Fig. ??. As suggested in [?], two important features of this flow are the drag and 
the lift coefficient of the cylinder. The definitions of these two coefficients read as 


2Fi 


Lift = ——_ 
and Lift pU2,DE’ 


Drag = (11) 


2Fu 
pU2,DH 


respectively, where Fy and F; are defined as 


Out Out 
_ =e F=- a E . (12 
Fa [ (on gn TPN ) dS and F; (on an” pry) dS. (12) 


Here S is the surface of the cylinder, H = 0.41m is the height of the channel, 
D = 0.1m is the diameter of the cylinder, nz and ny are the normal vectors, uz is 
the tangential velocity of u with t = (ny, —nz,0) and Um is the maximal inflow 
Uin = (Uin, Vin, Win) ([7]) with 


H — y)\(H — 


In this test case, the kinematic viscosity y = 107°m?/s and the density p = 
lkg/m°. The drag and lift coefficients of the cylinder are shown in Fig. ??. These 
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0 2 4 6 8 0 2 4 6 8 
Time Time 


Fig. 2 Drag coefficient (left) and lift coefficient (right) for the laminar flow. 


Table 1 Number of iterations and total compute time for a laminar flow on a mesh with 
1.6 x 107 elements (mesh size h ~ 0.001) using different time step sizes. The computations are 
carried with 2400 processors. 


At Newton GMRES_ Time 


0.01 3.0 102.2 106.5 
0.05 3.0 114.6 137.2 
0.1 3.0 120.0 112.0 
0.5 4.1 137.3 179.7 
1.0 4.9 147.5 241.7 


results are obtained on a mesh with about 1.6 x 10” elements (total degrees of 
freedom DOF = 1.1 x 10’) and a fixed time step At = 0.01s with the zero 
initial condition and zero body force. The maximal drag coefficient Dragmaz = 
3.2507, minimal lift coefficient Liftmin = —0.011427 and maximal lift coefficient 
Liftmaz = 0.002744. Another important parameter to compute is the difference 
of the pressure at the final time t = 8s between the front and back of the cylinder, 
and the value is —0.1131 in our test case and it agrees well with the results of 
[?,?]. 

We next study the numerical performance of the algorithm. Since we use a 
fully implicit method in which the time step size is no longer constrained by the 
Courant-Friedrichs-Lewy (CFL) condition, we can use very large time step size. 
Table ?? shows the number of linear and nonlinear iterations and the compute time 
of the fully implicit method with respect to different time step sizes. From this 
table, we see that the algorithm is stable, and converges quite well with different 
time step size. The parallel performance of the algorithm is shown in Table ??. This 
table shows that when we increase the number of processors, the average number 
of Newton iterations (Newton) per time step does not change, the average number 
of GMRES iterations per Newton step (GMRES) increases reasonably, and the 
compute time per time step decreases quickly. The left figure of Fig. ?? shows the 
speedup of the algorithm for this problem and it indicates that the algorithm has 
a superlinear speedup. The blue line refers to the linear speedup which means that 
the compute time is exactly halved when the number of processors is doubled, and 
the red line is the actual speedup of the algorithm. The right figure of Fig. ?? is 
the average compute time per time step with respect to the number of processors. 
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Table 2 Parallel performance of the algorithm for the laminar flow simulation. Here DOF = 
1.1 x 107 and Re = 20. 


Np Newton GMRES Time 


1024 3.1 61.5 597.5 
1536 3.1 66.6 272.0 
2048 3.1 80.3 180.1 
3072 3.1 105.4 101.6 


We note that the number of Newton iterations is small in the test cases. This is 
because the full Jacobian is used in the algorithm. If the derivative with respect 
to some of the stabilization terms are dropped, the number of Newton iterations 
increases. 


6 597. 
5 
4 O 
5 = 272 
k g 
Q =) 
2 Q 
s E 180.1 
(ð) 
= |deal 
=+ Actual 
101 § 
1 1536 2048 3072 1024 1536 2048 2 
Number of processors Number of processors 


Fig. 3 The speedup and the average compute time per time step (log-log scaled) for the flow 
passing a cylinder simulation. Here DOF = 1.1 x 10’ and Re = 20. 


5.2 High speed train simulation 


In this section, we study a flow passing a high speed train with a realistic train ge- 
ometry, and a realistic Reynolds number. Many engineering and safety problems 
are being raised with the rapid development of high speed rail transportation. 
The wind conditions around a train body influence the stability of the train sig- 
nificantly. A thorough understanding of the wind around the train is extremely 
important in the shape design of the train, and also in the control of the train 
under different weather conditions, etc. The computation of this 3D problem is 
very demanding because of the large computational domain, the complexity of the 
geometry and the ill-conditionness of the discretized mathematical model. 

A realistic three dimensional train model with two cars is considered in this 
paper. The geometry is created by AutoCAD and the flow domain is meshed by 
ANSYS; see Fig. ?? for the details of the train model, the computational domain 
and a local view of the computational mesh. Standard flow parameters include the 
dynamic viscosity u = 1.831 x 107%kg/(m - s) and the density p = 1.185kg/m?. 
The boundary conditions are defined as follows: a time-varying boundary condition 
Uin = (0, vin - t,0) is employed on the inlet Dinter (when vin: t > 100, let uin = 
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(0, 100, 0)) where vin is a constant and t is a time, stress free boundary condition 
(o-n = 0) is used on the outlet boundary Toutiet and a no-slip boundary condition 
(u = 0) is given on the wall boundary Iwai (all the surfaces except Iiniet and 
Iouttet). The zero initial condition and zero body force are used for this test case. 
We assume the velocity of the train is 360km/h, that is, vin = 100m/s for the 
inlet boundary condition. The Reynolds number for this test case 


3. n 
Re = Clink _ 1.185kg/m* -100m/s-3m _, g, 107, 
u 1.831 x 10-5kg/(m- s) 


where the characteristic length L is chosen as the height of the train. 


outlet 


inlet 


Fig. 4 Model for the simulation of the flow around a high speed train (top) and the mesh 
around the train (bottom). Here the size of the box (top left) is 140m x 23m x 9m and the 
dimension of the train (top right) is 19m x 3m x 2m. 


The velocity magnitudes distribution around the train at t = 1.0s (time step 
size At = 0.01) is shown in Fig. ?? and Fig. ??. From these figures, we see that 
the flow is very complicated at the end of the train and under the train, and more 
details can be viewed in the stream trace figures Fig. ?? and Fig. ??. 

To investigate the parallel performance and parallel scalability of the algorithm 
for this complicated problem, we choose two meshes, one with about 1.1 x 107 
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Fig. 5 The velocity distribution around the train at t = 1.0s. Here vin = 100m/s, p = 
1.83 x 10~-5pa-s, Re = 107. 


Table 3 Parallel performance of the high speed train simulation. Here ô = 6 and Re = 107. 


DOF =8 x 10° DOF = 1.7 x 10° 
np Newton GMRES Time Newton GMRES Time 
1024 4.0 91.4 640.3 4.0 55.3 1501.2 
1536 4.0 113.5 373.1 4.0 59.6 823.0 
2048 4.0 136.2 266.8 4.0 62.1 420.6 
3072 4.0 152.6 173.7 4.0 66.3 273.1 


elements (8 million degrees of freedom (DOF)), and the other with about 2.2 x 
10’ elements (17 million DOF). Table ?? shows the average number of Newton 
iterations per time step, the average number of GMRES iterations per Newton 
step and the average compute time per time step. The results are averaged values 
over the first 10 time steps. From this table, we see that the number of Newton 
iterations does not change and the average number of GMRES iterations increases 
mildly as the number of processors increases. The compute time is more than 
halved as the number of processors is doubled, which means that the algorithm 
has a superlinear speedup. The superlinear speedup is also shown in the left figure 
of Fig. ??. The right figure of Fig. ?? is the average compute time per time step 
with respect to the number of processors. 
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10 20 30 40 50 60 70 80 90 100 110 120 


Fig. 6 The velocity distribution at the end of the train at t = 1.0s. Here vin = 100m/s, 
u = 1.83 x 10-5pa-s, Re = 107. 


Table 4 Comparison of different local solver for the high speed train simulation. Here DOF = 
1.7 x 107 and Re = 10". The fill-in level of the ILU is 4. 


Newton GMRES Time 
LU ILU(4) LU ILU(4) LU ILU(4) 
1024 4.0 4.0 55.3 73.6 1501.2 395.2 
1536 4.0 4.0 59.6 79.8 823.0 327.1 
2048 4.0 4.0 62.1 82.0 420.6 239.5 
3072 4.0 4.0 66.3 89.5 273.1 182.7 


In these experiments just shown, the subdomain problems are solved by LU 
factorization which is expensive in both computation and memory requirement. As 
we mentioned in Section 4, an alternative approach is point-block ILU. We show 
a comparison of the different subdomain solvers in Table ?? for this test case, and 
this table reveals that while ILU takes more GMRES iterations than LU, it takes 
much less compute time than LU, especially when the number of processors is 
small. 

For the overlapping Schwarz preconditioner, an important parameter that in- 
fluences the strength of the preconditioner is the overlapping size 6. From the 
theory of the overlapping Schwarz method, larger overlap often implies a faster 
convergence (fewer GMRES iterations), at least for elliptic systems with sufficient 
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Fig. 7 The stream trace of the flow at the end of the train at t = 1.0s. Here vin = 100m/s, 


u = 1.83 x 10-5 pa-s, Re = 107. 


Table 5 The effect of various choices of the overlapping parameter 6 for the high speed train 


simulation. Here DOF = 8 x 10° and Re = 10”. 


ô Np Newton GMRES Time 
2 1024 4.0 209.6 338.0 
4 1024 4.0 131.3 410.1 
6 1024 4.0 91.4 510.6 
2 2048 4.0 301.5 133.8 
4 2048 4.0 169.2 175.7 
6 2048 4.0 136.2 262.7 


regularity [?]. However, larger overlap also means larger subdomain problems and 
more information transfer between subdomains, as a result, the overall compute 
time may increase. Table ?? shows the effect of the overlapping parameter for 
solving the high speed train simulation problem. The best result is obtained with 
6 = 2. For 6 = 0, 1, the preconditioner is so weak and the algorithm doesn’t 


converge for some cases. 


Besides the parallel performance and scalability, the robustness with respect 
to the Reynolds number Re is another important consideration in the design of 
solution algorithms for the flow simulation problems. Table ?? shows that the 
algorithm is quite robust for a wide range of Reynolds numbers. 
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Fig. 8 The stream trace of the flow around the wheels of the train at t = 1.0s. Here vin = 
100m/s, u = 1.83 x 10-5 pa-s, Re = 107. 
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Fig. 9 The speedup and the average compute time per time step (log-log scaled) for the high 
speed train simulation. Here DOF = 8 x 10° and Re = 10’. 


6 Concluding remarks 


A domain decomposition based parallel algorithm for the direct numerical simula- 
tion of complex flows is introduced and studied in this paper. The algorithm begins 
with a fully implicit discretization of the unsteady incompressible Navier-Stokes 
equations on an unstructured mesh with a stabilized finite element method, then 
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Table 6 The robustness of the algorithm with respect to the Reynolds number Re for the 
train simulation. Here DOF = 8 x 10°. 


Re Np Newton GMRES Time 
1.0 x 10° 1024 3.0 109.1 390.4 
1.0x 108 1024 3.0 99.5 385.5 
1.0x 107 1024 4.0 91.4 512.7 
1.0 x 10° 2048 3.0 137.7 195.4 
1.0x 108 2048 3.0 129.9 195.1 
1.0x 107 2048 4.0 136.2 261.1 


an inexact Newton method is employed to solve the large nonlinear system at each 
time step, and a preconditioned GMRES method is employed to solve the linear 
Jacobian system in each Newton step with a one-level additive Schwarz precondi- 
tioner. We tested the algorithm for a benchmark problem and a high speed train 
simulation problem with more than 17 million degrees of freedom. The numerical 
experiments showed that the method has a superlinear speedup with up to 3072 
processors. 
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